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Abstract 

The excluded volume occupied by protein sidechains and the requirement of high packing density 
in the protein interior should severely limit the number of sidechain conformations compatible with 
a given native backbone. To examine the relationship between sidechain geometry and sidechain 
packing, we use an all-atom Monte Carlo simulation to sample the large space of sidechain conforma- 
tions. We study three models of excluded volume and use umbrella sampling to effectively explore 
the entire space. We find that while excluded volume constraints reduce the size of conformational 
space by many orders of magnitude, the number of allowed conformations is still large. An average 
repacked conformation has 20% of its \ angles in a non-native state, a marked reduction from the 
expected 67% in the absence of excluded volume. Interestingly, well-packed conformations with up 
to 50% non-native x' s exist. The repacked conformations have native packing density as measured 
by a standard Voronoi procedure. Entropy is distributed non-uniformly over positions, and we par- 
tially explain the observed distribution using rotamer probabilities derived from the pdb database. 
In several cases, native rotamers which occur infrequently in the pdb database are seen with high 
probability in our simulation, indicating that sequence-specific excluded volume interactions can 
stabilize rotamers that are rare for a given backbone. In spite of our finding that 65% of the native 
rotamers and 85% of xi angles can be correctly predicted on the basis of excluded volume only, 
95% of positions can accomodate more than 1 rotamer in simulation. We estimate that in order 
to quench the sidechain entropy observed in the presence of excluded volume interactions, other 
interactions (hydrophobic, polar, electrostatic) must provide an additional stabilization of at least 
0.6 kT per residue in order to single out the native state. 
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Introduction 

One remarkable characteristic of proteins is that sidechains comprising the hydrophobic core are as 
closely packed as organic crystals.i In most proteins with known 3D structures, the core residues 
are rarely disordered and adopt one of a small number of alternative conformations. This almost 
unique packing is effected by a combination of steric interactions (excluded volume effects) and 
energetic stabilization (hydrophobic, polar, and charge interactions). The proportions by which 
these interactions contribute to the overall stability is unknown, but several studies suggest that 
steric and hydrophobic interactions are of primary importance.i'S 

Sidechain packing has been studied intensively for various reasons, notably: 1) It is thought to 
be a crucial piece of the protein folding puzzle;!'! 2) The selection of protein sequences through 
evolution may have been influenced by how well a sequence can be packed for a particular fold;^ 
3) Accurate packing algorithms are necessary for complete protein structure prediction; and 4) 
Existing threading and homology modeling algorithms may be significantly improved by a better 
understanding of how sidechains are stabilized in the core.l'l A plethora of computational methods 
for modeling protein sidechains and energetics exist,B!'!'0 and yet which ones best capture the 
underlying physics is unclear. This is due, in part, to the use of energy functions containing many 
different types of interactions. 

In this paper, we examine the role of excluded volume in isolation of all other types of inter- 
actions. This allows us to address the importance of geometry in determining the native state 
of protein sidechains. We use an all-atom, rotamer-based model of a protein to obtain repacked 
conformations of its interior sidechains. By choosing the most realistic representation of sidechain 
and backbone geometries, we eliminate errors encountered in coarse-grained models. 
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Because there is no consensus on how excluded volume interactions should be modeled in the 
context of proteins, we consider three different models. The simplest model treats all heavy atoms 
as hard spheres which are not allowed to overlap. The second model further restricts the sterically 
allowed space by adding a second shell around the hard spheres which tolerates only a limited 
number of overlaps throughout the entire protein. The third model uses a continuous r~ 12 potential, 
which is the repulsive contribution from a Lennard- Jones potential. Since a potential which decays 
faster than our chosen r~ 12 potential would allow more sidechain conformations, and since a slower 
decaying function is generally not used to model sterics,0 the three we have chosen should cover 
most of the possibilities as far as models featuring spherically symmetric, atom-based potentials. 

For each model, we obtain the distribution of rms values for sidechain conformations satisfying 
excluded volume constraints. This amounts to a full characterization of the ground-state ensemble 
of protein sidechains subject to excluded volume effects only. We find a vast number of significantly 
different conformations with native-like packing density that meet excluded volume constraints, 
implying that excluded volume alone cannot stabilize the native conformation of sidechains even 
when the backbone is fixed. We provide an estimate for the amount of additional stabilization that 
must be provided through interactions other than sterics. The results obtained using three different 
models are consistent with each other, suggesting that our main conclusions are model-independent. 
We also examine the correlation between our simulations and the distribution of rotamers in the 
pdb database, and discuss implications for sidechain prediction. Our main goal is to elucidate the 
role of excluded volume in establishing the ensemble of conformations compatible with the folded 
state of a protein. 
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Methods 

All- Atom Protein Representation. Three proteins were used in this study: photoactive yellow 
protein (1F9I), subtilisin (1GCI), and concanavalin (1NLS). We chose these proteins because they 
have very high resolution crystal structures, and are of different secondary-structure classes: subtil- 
isin is a 0.78 A resolution structure, and is an all a-helical protein; concanavalin (0.94 A resolution) 
is all /3-sheet; and photoactive yellow protein (f.f A resolution) is a/f3. All heavy atoms of the 
proteins present in the crystal structure were represented. Each sidechain torsion was allowed to 
take on one of three rotamer valuesS plus a random noise of ±10°. Rotamer probabilities^ were 
not used in the simulation. At all times, the atoms of the entire backbone were kept fixed in their 
original crystal structure positions. 

Models of sterics. The excluded volume interactions of protein sidechains were modeled in three 
ways: 

1) In the hard-sphere model atoms were treated as impenetrable spheres of given radii. These 
hard-sphere radii are necessarily smaller than the van der Waals (vdW) radii, and were defined 
by scaling the vdW radii by some factor a^. All atom- atom interactions at distances smaller than 
the sum of the hard-sphere radii ( "hard clashes" ) were strictly forbidden. A pair of atoms i and j 
separated by a distance r^ is said to be a hard clash if < (r (i) + r (j)), where r (i) is the 
vdW radius of atom i. The set of vdW radii determined inEl was used, was taken to be the 
largest value such that all hard clashes in the native protein conformation can be eliminated within 
the ±10° allowed at each torsion; we found that cth= 0.76 for concanavalin and photoactive yellow 
protein, and 0.77 for subtilisin. The steric energy, Uh, of a conformation in this model is given by 
the number of hard clashes. Only conformations with Uh — are sterically allowed in this model. 



6 



2) Since results could depend on the choice of ah, we explored a soft-sphere model in which 
atoms consisted of two radii - the hard-sphere radii fixed by ah, and somewhat larger, soft-sphere 
radii. Soft radii were defined by scaling vdW radii by a parameter a s , where ah < a s < 1. A soft 
clash occurs when ah (r (i) +r (j)) < < a s (r (i) + r (j)). All hard clashes were forbidden as 
before, while only a limited number of soft clashes are allowed over the entire protein. The steric 
energy, U, of a conformation is given by U = Uh + U s , where Uh is the number of hard atom-atom 
clashes, and U s is the number of soft atom-atom clashes above threshold. The threshold number 
of soft clashes allowed is a function of a s , as described in Fig. |l[ Only conformations with U = 
are sterically allowed in this model. Since van der Waals radii are strictly greater than excluded 
volume radii, as demonstrated in numerous studies of gas phase organic molecules,0 a s must be 
strictly less than 1.0, and we include a s = 1.0 in our plots only as an upper limit on the radii. 

3) The third model used the repulsive term of a Lennard- Jones potential. For each pair of atoms 
we fit the LJ potential, A/r 12 — 1/r 6 , so that the minimum coincides with the sum of the vdW radii 
taken from.0 With this model, the steric energy of a conformation is given by Ulj = ^ij/ r }j 
where the sum is taken over all pairs of atoms, and = (r (i) + r (j)) 6 /2. We considered a 
conformation to be sterically allowed if its energy was less than or equal to the energy of the native 
crystal structure. 

Monte Carlo Packing Algorithm. Initial random sidechain conformations were packed to 
a given rms interval [R m in, Rmax] by a Metropolis Monte Carlo simulation. At each step of 
the simulation, a random position was selected and changed to another rotamer. The move was 
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accepted with probability p given by 



P 



1 5H < 

exp(—6H/T) otherwise 



with 



5H = 5U+l 



C (^min R) R ^ Rmin 
Rmin R Rmax 

C (-R Rmax) R ^* Rmax 

where C/ is the steric energy, i? is the rms, and C = 10. All reported rms values were computed over 



the subset of sidechains having more than 40% occluded surface area as defined in reference 15, and 
at least 1 rotatable x angle. All heavy atoms of a sidechain, including the Cp atom, entered into 
the rms calculation as in reference [| Prolines were kept fixed in their crystal structure positions 
throughout this work. The 40% occluded positions for each protein are given below. Additionally, 
by visual inspection we found positions whose C a -Cp bonds were pointing toward the core, and/or 
whose sidechains were surrounded by other atoms on all sides. These positions are indicated in bold 
type. 



Subtilisin: 

(1) 8I,11V,17H,21L,22T,26V,27K,28V,30V 

(10) 31L,32D,33T,35I,36S,38H,40D,41L,49F,50V 

(20) 58D,62H,64T,65H,66V,69T,70I,73L,79V,80L 

(30) 82V,87E,88L,89Y,91V,92K,93V,94L,104S,105I 

(40) 109L,111W,117M,119V,121N,122L,123S,130S,133L,137V 
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(50) 139S,141T,145V,146L,147V,148V,151S,159I,160S,161Y 
(60) 165Y,169M,171V,174T,175D,183F,184S,185Q,186Y,190L 
(70) 191D,192I,193V,197V,199V,201S,202T,203Y,214T,215S 
(80) 216M,218T,22OH,221V,227L,228V,229K,232N,237N,240I 
(90) 241R,243H,244L,245K,256L,257Y,261L,262V,263N,265E 
(100)268T 

Concanavalin: 

(1) 4I,5V,7V,8E,9L,10D,11T,14N,17I 

(10) 19D,24H,25I,27I,28D,29I,31S,32V,40W,52I 

(20) 54Y,55N,61L,62S,65V,67Y,75V,79V,80D,81L 

(30) 85L,89V,90R,91V,93L,94S,96S,97T,102E,103T 

(40) 104N,105T,106I,108S,109W,110S,111F,113S,115L,117S 

(50) 128F,130F,133F,134S,140L,143Q,145D,148T,153N,154L 

(60) 156L,157T,164S,169S,170V,172R,174L,175F,179V,180H 

(70) 191F,195F,196T,197F,199I,201S,208D,210I,211A,213F 

(80) 214I,215S,216N,219S,225S,229L,230L,232L,233F 

Photoactive Yellow Protein: 
(1) 6F,11I,12E,15L,18M,22Q,23L,26L,28F 
(10) 31I,32Q,33L,34D,38N,39I,41Q,42F,43N,46E 
(20) 49I,50T,57V,61N,62F,63F,66V,70T,72S,75F 
(30) 76Y,79F,83V,88L,90T,92F,96F,105V,106K,107V 

9 



(40) 108H,109M,110K,111K,117S,118Y,119W,120V,121F,122V 



Umbrella sampling. Umbrella samplings was used to determine the number of states Q(R) 
for each of the 3 models. First, different sterically allowed conformations were collected for rms 
intervals of size 0.05 A over the entire rms range (0 — 4.0 A). The rms values of conformations 
differing only by random noise were averaged. Next, the probability to observe a conformation with 
rms R, Pi(R), was obtained for the z-th rms interval, [Rt, -R«+i]. Since the distribution of conforma- 
tions Q(R) must be a continuous function of R, Q(R) can be determined by appropriately scaling 
the probabilities pt(R) to ensure continuity at interval boundaries. Specifically, the probabilities 
Pi(R) are sequentially scaled by / Pi(Ri) , starting from % = 2. fl(R) is then obtained by 

multiplying all rescaled p^s by ft(Ri) / pi(Ri) . 

The umbrella sampling scheme can only work provided that the conformational space within 
each bin is explored evenly during simulation. In order to move the system into a particular rms 
bin, we had to use C — 10 and work at very low temperature (T = 0.005). These conditions were 
necessary due to the large amount of entropy available just outside any given bin, both in rms 
and in steric energy (the number of clashed conformations is exponentially larger than the number 
of well-packed ones). The downside is that the system may end up in a part of conformational 
space within a given bin that is disconnected from other parts of the bin, i.e. the system behaves 
non-ergodically under the conditions in which it must be sampled. The rms histograms within such 
isolated clusters are not representative of the entire space within a bin, and can therefore lead to 
large errors when the full rms histogram is assembled. To overcome this difficulty, we performed 200 
short runs, starting from random conformations, for each bin and computed the size of the cluster 
explored in each run. The cluster size is defined as the average torsional distance between all pairs 

10 



of uncorrelated conformations encountered in a run. We found that most bins had many clusters 
of various sizes. Since the number of states within a cluster should scale exponentially with its 
size, the largest cluster encountered should dominate the statistics within a given bin. We therefore 
sampled each bin by starting a run from a conformation obtained in a short run that had entered 
the large cluster. For every rms bin, we collected 10,000 states in the largest cluster, and used only 
these states when assembling the full rms histogram. 

Several tricks proved handy when trying to find the large cluster. The plot of the largest cluster's 
size over rms should be relatively smooth. If we found a cluster of size 20 at rms 0.90, a cluster 
of size 10 at rms 0.95, and a cluster of size 24 at rms 1.00, we could be certain that the 0.95 bin 
would be badly sampled. We could thus focus our efforts on bins with anomalously low cluster 
sizes. Some bins required up to 500 short runs before the large cluster was found. We observed 
that there is only one cluster at low rms values. If we started a run from the lowest rms bin, and 
allowed it to escape into a higher rms bin, the run would always end up in the largest cluster for 
that bin. This worked up to rms 1.2 because entropy was steeply increasing with rms, and one 
cluster was dominating the statistics. For rms values between 1.2 and 2.0, several clusters of similar 
sizes emerged, and thus sampling the largest one was no longer adequate. In this rms range, the 
distribution of states was relatively flat over rms, and thus the entire range could be fully sampled 
as a single large bin. For rms higher than two, the method of multiple short runs worked well. The 
distribution of clusters as rms is varied emerged as an interesting problem in itself, but was beyond 
the scope of this paper. 

CHARMM minimization of structures Using CHARMM, we minimized and added hydro- 
gen atoms to a set of 50 randomly chosen repacked conformations of each of the three proteins used 
in this study. All such minimizations, including those performed for other randomly selected packed 
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structures, did not change the rotamer states of the sidechains, but did eliminate close contacts (as 
defined by CHARMM) resulting from the addition of hydrogens. A set of 1000 repacked conforma- 
tions are available for public download at http: / / www-shakh.harvard.edu/~shimada/index.htmJ . 



Results 

Figure ^| shows the distribution of packed (sterically allowed) conformations for three proteins, 
obtained over a range of rms values using umbrella sampling. For each protein, the rms values are 
computed over the subset of partially occluded residues (see Methods), and thus only the occluded 
positions contribute to the total number of states. All other positions are free to repack, but 
are not counted in our sampling. With a s = ah (the hard-sphere model), the number of packed 
conformations of subtilisin is approximately 10 32 . As a s is increased using the soft-sphere model, 
the number of packed conformations decreases. This trend reverses when the soft radii reach the 
van der Waals size (see Discussion), at a s = 1.0, and the number of repackings is approximately 
10 19 . We emphasize that the soft-sphere model tolerates fewer clashes than the hard-sphere model, 
because it allows no hard-sphere clashes (at ah) and only a limited number of soft-sphere clashes (at 
a s ). Using the steric LJ model, the number of repackings is 10 25 . Since the number of repackings 
scales exponentially in the number of free torsions, we give the number of rotamer states per torsion 
in Table 2 for each protein, calculated from the LJ model. These numbers are obtained by taking 
the n-th root of the total number of states, where n is either the number of free torsions or residues. 
Normalizing by the number of torsions gives smaller variation over proteins than when normalizing 
by number of residues, due to compositional differences between proteins (variation in number of 
torsions per residue among proteins). 
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As a control, we applied the same sampling method to obtain the distribution of states for all 
possible sidechain conformations, packed and not packed, as shown in Figure |3|. The total number 
of random sidechain conformations for a protein with n free torsions is 3™. For all three proteins, 
the area under the random curve obtained from umbrella sampling is nlog(3) ± 2, demonstrating 
that the sampling is reliable, and that the total error is approximately 2 orders of magnitude. For 
each protein we included the distribution of states using the hard-sphere model (dashed line) in 
order to show the dramatic reduction in available conformations once excluded volume constraints 
are applied. 

In Figure |4] we show the sampling of packed conformations for subtilisin without umbrella 
constraints. The same data is shown separately over rms and over torsional distance from the 
native conformation. We see that the distributions over rms for the different steric models often 
yield two humps. When plotted over torsional distance, all models give perfect normal distributions. 
We found no correlation between rms and torsional distance in the range of rms that was sampled 
without umbrella constraints. One reason for this lack of correlation is that deviations in the xi 
angle yield much larger deviations in rms than the other \ angles. Thus, unless rms is very low, it 
yields almost no information about the torsional states of the protein. 

The packing density of a conformation is usually defined as the volume of the van der Waals 
envelope of a protein divided by its Voronoi volume. The density of a random sample of 1000 
repacked conformations of each protein was computed and found to vary by only ±1% of the native 
density. Given this minimal variation in packing density, we conclude that the requirement of having 
native-like density does not significantly reduce the number of packed sidechain conformations. In 
addition, we were able to build hydrogen atoms (using CHARMM0) on randomly selected, packed 
conformations, confirming that the conformations were not too compact. 
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While the number of packed conformations is large, it is conceivable that most repackings are 
structuraly similar to one another, and thus that the number of truly different repackings may 
be much smaller than observed. To determine how different these repacked conformations are, we 
uniquely identified each conformation by the rotamers of the partially occluded positions. The 
distance between two conformations was measured by counting the number of single-bond torsions 
which had to be rotated (by 120°) in order to obtain one conformation from the other. For example, 
for the partially occluded positions of subtilisin, having a total of 174 torsions, we expect random 
conformations to differ by 115 (= 174 x 2/3) torsion moves. 

The average torsional distance between repacked conformations (Fig. shows that the con- 
formations are different from each other. The similarity between conformations depends on the 
particular model. In general, larger soft radii lead to more similar repackings until the soft radii 
reach the van der Waals size (see Discussion). For the three proteins we examined, average con- 
formations obtained with the LJ model differ at 20%-25% of their torsions. We note that the 
differences between the soft-sphere models for a s > 0.90 and the LJ model are minimal in all three 
proteins. 

The entropy of each sidechain was computed by sampling uncorrelated states meeting the LJ 
constraints for each protein (Fig. [5]). The plot shows that entropy is not distributed evenly over 
positions. Approximately 25% of positions have nearly zero entropy, implying that they are rarely 
repacked. We note that only 5% of positions have exactly zero entropy, demonstrating that almost 
all positions can accomodate more than one rotamer. The maximum entropy possible for a sidechain 
with n torsions is nln(3) = l.ln. Since we observe positions with entropy greater than 1.1, it is 
clear that large residues can take on several different rotamers. 

To explain the uneven distribution of entropy, we tried to correlate it with various quantities. 
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The entropy of a given position showed no correlation with either crystallographic B-factors, or with 
occluded surface area. We found significant but relatively low correlation (R = 0.65) between the 
entropies we observed and entropies calculated from rotamer probabilities from the PDB database^ 
To investigate this trend further, we plotted the rotamer probabilities observed in our simulation 
versus probabilities observed in the pdb database. Specifically, for each rotamer state at every 
partially occluded position we plotted the frequency with which it was observed in our simulation, 
versus the frequency with which it occurs in the pdb database at its particular and ip backbone 
angles. 

These scatter plots for the three proteins are shown in Figure [| The plots indicate that native 
rotamers (red marks, corresponding to the rotamers observed in the crystal structure) are often 
seen with high probability in our simulations. Since we observe in simulations only a few non- 
native rotamers (blue marks) with probability greater than 90%, the low entropy positions in 
Figure ^ largely correspond to the native conformation. The pdb probabilities are obtained by 
averaging over many different folds and sequences, and thus sequence-specific effects are largely 
averaged out, leaving only the backbone <fi and ip angles as a determinant of rotamer frequency. 
Hence the simulation frequency of rotamers lying near the diagonal is explained well by the local 
backbone conformation at that position. Rotamers lying off the diagonal are being observed with 
anomalous frequency for their local backbone conformation. Because the backbone conformation 
does not explain the frequency of these positions, they must be particularly sensitive to one or 
more sequence-specific interactions. We observe several native rotamers with anomalously high 
probability (upper-left corner), indicating that sequence-specific steric interactions can completely 
stabilize the native rotamer even when it is rarely observed in most other proteins. We see a few 
non-native rotamers anomalously frequently (blue marks in the upper-left of the figure). These 
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rotamers are preferentially selected by sequence-dependent excluded volume interactions over the 
native rotamers. 

One can classify positions, loosely speaking, as either "well-predicted" if the correct native re- 
tainer is seen more than 50% of the time, or "poorly-predicted" if the native rotamer is seen less 
than 50% of the time. Non-native rotamers are plotted as squares if they belong to "well-predicted" 
positions, and circles if they belong to "poorly-predicted" positions. A non-native rotamer which 
is seen with high probability in the PDB database, may be seen with low probability in our simula- 
tion, lying in the bottom-right corner of the figures. Most points in this region are squares, that is, 
non-native rotamers which are being seen too infrequently because they belong to "well-predicted" 
positions. There are very few circles in this region indicating that, for the most part, significant de- 
viations from observed pdb rotamer probabilities are due to sequence-specific interactions stabilizing 
the native rotamer preferentially over a more popular rotamer. 

Concanavalin appears to have more scatter than the other two proteins. In particular, it has 
more native rotamers below and near the diagonal. Concanavalin is an all-/3 protein, while subtilisin 
is all-a, and photoactive yellow protein is a//3. It is likely that the a-helices of subtilisin, which act 
to compactify the fold locally, impose more stringent constraints on the allowed rotamers at a given 
position. This is manifested in our observation of fewer near-diagonal native rotamers in subtilisin 
than in concanavalin. 

To check whether simulation probabilities correlate with amino acid type, we plotted the se- 
quence identity of the native rotamers as letters in Figure |7|. We see that points lying in the 
lower left-hand corner are generally polar or charged amino acids. Native rotamers seen with high- 
probability in simulation are mostly hydrophobic. There are exceptions to both of these general 
trends. Since hydrophobic residues are on average more buried than charged residues, one might 
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suggest that the separation based on polarity is actually due to solvent-exposure. As stated above, 
however, we found no correlation between occluded surface and the simulation probabilities. In 
order to verify that this was not due to some peculiarity of the definition of occluded surface, we 
visually classified positions as buried or partially exposed using rough criteria outlined in Methods. 
The purple letters in Figure |?| correspond to buried positions (as defined in Methods), while the 
green letters correspond to partially exposed positions. We see no significant clustering of either 
color in these figures, confirming that entropy in our simulations is not governed by exposure. We 
conclude, not surprisingly, that the importance of electrostatic interactions, rather than solvent 
exposure, explains the poorly-predicted native rotamers of charged amino acids. 

We examined the correlation between the rotamer changes made at a pair of positions x and 
y by measuring the number of states lost, exp (AS"), due to interactions between x and y, where 
AS = S xy — (S x + Sy), S xy is the joint entropy, and S x and S y are the individual entropies. We 
found only 3 pairs of positions in subtilisin exhibiting any correlation (exp (AS") between 0.1 and 
0.3). Similar numbers were found in the other two proteins. Repacking at a single position can 
therefore be accomodated by the surrounding residues in many different ways. 

Given that there is a significant amount of torsional entropy in our model, it interesting to 
ask how well an average sidechain structure prediction based solely on excluded volume would 
perform. For subtilisin, we observe the rms of predictions broadly distributed between 1.4 and 1.8 
angsroms. Using either the LJ model or the a s = 0.90 model, we predict correct rotamers for 65% 
of residues, and predict xi angles correctly for 85% of positions. The best performing sidechain 
prediction algorithms, which incorporate several physical and knowledge-based terms, predict 93% 
of xi angles, so excluded volume alone performs very well in this regard. 
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Discussion 

Upon folding to its native backbone topology, the conformational space of a protein is cut by many 
orders of magnitude (Fig. due to excluded volume. At the level of a single torsion, the number of 
rotamer states is reduced by approximately a factor two. The native backbone can still accomodate 
many different sidechain conformations, while maintaining its high degree of compactness. Other 
sources of stabilization (such as attractive van der Waals interactions, hydrogen bonding, polar 
interactions) are necessary to overcome the final entropic cost associated with fixing the sidechains 
in their native conformations. 

The packing of sidechains in a native backbone has often been described as a jigsaw-puzzle: 
the shapes of the sidechains (the pieces) are enough to determine how they will fit in the protein 
interior.^ This argument has been used to explain the high density observed in proteins. The 
large number of repackings and the diversity of these conformations (Fig. |^), all with native- 
like density, suggest that a precise fitting of residues is not necessary for a high packing density. 
The rotamer frequency plots demonstrate that while there is a sequence-specific excluded volume 
effect that acts to single out the native rotamers, it is not sufficient to overcome the fact that the 
backbone generally allows several different rotamers at each position. The distribution of entropy 
shows that many positions admit significantly more than one rotamer. Furthermore, the correlation 
calculations show that a conformational change at one position does not necessitate a particular 
change at spatially proximate positions. We conclude that while sidechains must fit together to avoid 
steric clashes and maintain native-like density, these constraints alone do not give rise to precise and 
unique fitting of residues. Other energetic interactions must bias sidechain conformations toward 
the native state, leading to the observation of directional packing in protein structures. 
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The three models employed in our study yield consistent results. All three vastly reduce the 
total number of available states. The hard-shere model is consistently more forgiving than the other 
models. The soft-sphere models with a s > 0.90 all roughly agree to ±2 orders of magnitude. The LJ 
model is only slightly more tolerant than the soft-sphere models, suggesting that a two-shell model 
can adequately model sterics without introducing continuous distance dependence. The soft-sphere 
model has two limiting behaviors. For a s = ah, the soft-sphere model is identical to the hard-sphere 
model. For a s ^ 1, all atom pairs in the native structure are considered to clash, and thus the 
soft-sphere model will tolerate all possible clashes. Since the hard-sphere excluded volume is the 
only remaining constraint, the soft-sphere model is again equivalent to the hard-sphere model in 
this limit. The observed rise in the number of states as a s reaches 1.0 is the necessary turn-around 
between these two limiting behaviors. Sterics must therefore be modelled with a s < 1. When 
a s = 1.0, the radii have reached van der Waals sizes, and are in the regime of attraction rather 
than excluded volume. While increasing the radii past the hard-core value gives a significant gain 
in stability, we see that there is a limit on how much stabilization can be achieved from excluded 
volume effects. 

Our ability to predict 85% of xi angles correctly shows that excluded volume is of primary 
importance in determining the general orientation of a sidechain with respect to its backbone. Our 
prediction of only 65% of positions correctly shows that sterics are less important in determining 
the rest of the sidechain's rotameric conformation. Of particular interest are native rotamers that 
are observed in the PDB database in less than 10% of proteins. Many of these rotamers are 
observed with significantly higher probability in our simulation, while some are observed at their 
expected PDB level. The excluded volume interaction is thus exhibiting some frustration between 
the distribution of states compatible with the local backbone conformation, and the sequence- 
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specific neighborhood of the residue. Other interactions must provide significant stabilization to 
overcome this problem. 

A natural extension of our work is to add an attractive interaction to the potential which would 
stabilize the native packing. From the distribution of packed conformations, we estimate that the 
native conformation must be about 0.6/cT per sidechain (see Table 2) lower in energy than the 
average packed conformation with native backbone. This estimate can be used to guide efforts 
towards designing better potentials for sidechain packing. Furthermore, the techniques presented 
here, combined with a stepwise increase in the complexity of potentials, can determine how much 
stabilization each additional interaction provides. It will be interesting to see whether a potential 
that significantly stabilizes the native rotamers can stabilize the native fold when backbone motions 
are added to the model. 
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Figures 

Figure 1 

The number of soft clashes in near-native conformations as a function of a s for each protein studied. 
The solid line denotes the average number observed in an ensemble of near-native conformations 
generated by adding ±10° random noise to the native sidechains. The dotted line is determined 
by taking three standard deviations below the average, and corresponds to the soft clash threshold 
used in this study. 

Figure 2 

The figures in the left column show the density of packed conformations as a function of rms for 
various values of a s (indicated by the colored, solid curves). The dashed curve is the corresponding 
distribution using the repulsive Lennard- Jones model. Figures in the right column show the average 
number of bond rotations needed to bring one conformation to another as a function of rms for the 
various models. The average is taken over all pairs of 1000 randomly selected conformations found 
in a given rms window. 

Figure 3 

The distribution of random sidechain conformations for each of the three proteins (solid lines). 
Umbrella sampling over rms without any excluded volume constraints was used to construct these 
figures. The dashed lines correspond to the hard-sphere model (identical to the black curves from 
Figure 0) and are provided for comparison. 
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Figure 4 

Uncorrelated packed conformations for subtilisin sampled using the various models with no rms 
constraints are histogrammed separately over rms from the native crystal structure, and over tor- 
sional distance from the native rotamers. Each histogram is obtained over a total of 240,000 states. 
The states shown in this plot correpond to the tops of the rms histograms from Figure but were 
obtained without umbrella sampling. We therefore do not see any low rms states in these plots, 
because there are exponentially fewer states as rms is lowered. The native rotamer configuration of 
subtilisin was obtained by minimizing rms in simulation. The rms of this nearly-native rotameric 
state is 0.5 A from the crystal structure. The reason for the non-zero rms is that deviations of more 
than 10 degrees from perfect rotamers are observed in the crystal structure, and our simulations 
allows for no more than 10 degrees of tolerance around each rotameric value for a given sidechain 
X angle. 

Figure 5 

The entropy for the partially occluded positions of each protein. 50,000 uncorrelated packed con- 
formations were sampled for each protein using the repulsive LJ model without umbrella sampling. 
The probability of observing each rotamer at each position was calculated, and the entropy of each 
position was computed using S = —J^Pi^ogpi, where pi is the frequency of occurence of the i-ih 
rotamer at the given position. The histograms of these entropy values are also provided for each 
protein. 

Figure 6 

Observed simulation frequency of a rotamer vs. the PDB database frequency of that rotamer, 
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given its amino acid identity and the <fi-ip angles of its backbone. Database probabilities were taken 
from the rotamer library described in.§ Each point in the plots corresponds to a single rotamer. 
Red points are native rotamers (there is one red point for each sequence position); blue points are 
non-native rotamers (there are multiple blue points for each position). We called positions "well- 
predicted" if the native rotamer was observed more than 50% of the time, and "poorly-predicted" 
otherwise. We further classified the non-native rotamers (blue points) as squares if they belonged to 
"well-predicted" positions or circles if they belong to "poorly-predicted" positions. For example, if 
a native rotamer at a given position is observed 85% of the time, all the other, non-native rotamers 
for that position will be squares. Conversely, if at a different position the native rotamer is seen only 
15% of the time, all the non-native rotamers for that position will be circles. Simulation frequencies 
are calculated over 50,000 packed conformations obtained using the LJ model without umbrella 
sampling. 
Figure 7 

Simulation vs. PDB frequency for each native rotamer. The native rotamers for each protein are 
identified by their amino acid type (letter) and are colored purple if they are in a buried position, 
and green if they are in a partially exposed position. The buried positions are given in the methods 
section, indicated in bold type. 
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Tables 



atomic group 


r 


a h r 


C3H0 


1.61 


1.21 


C3H1 


1.76 


1.32 


C4H1 


1.88 


1.41 


C4H2 


1.88 


1.41 


C4H3 


1.88 


1.41 


N3H0 


1.64 


1.23 


N3H1 


1.64 


1.23 


N3H2 


1.64 


1.23 


N4H3 


1.64 


1.23 


O1H0 


1.42 


1.07 


02H1 


1.46 


1.10 


S2H0 


1.77 


1.33 


S2H1 


1.77 


1.33 



Table 1: 



Table 1 

The atomic groups, VdW radii (r) , and the hard core distances (a^r) used in our study. The atom 
groups and VdW radii were obtained from Table 2 of Tsai et al. (1999). The atomic group AnRm 
refers to an atom of element A with a chemical valence of n and m hydrogen atoms bonded to it. 
A methyl carbon (— CH 3 ), for example, falls under the C4H3 atomic group. 
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Tables 



protein 


# of X 's 


# of res. 


States per x/P er res - 


S per x/P er res - 


1F91 


99 


49 (125) 


1.52 / 2.33 


0.42 / 0.85 


1NLS 


151 


88 (237) 


1.46 / 1.92 


0.38 / 0.65 


1GCI 


174 


100 (269) 


1.39 / 1.78 


0.33 / 0.58 



Table 2: 

Table 2 



Statistics of repackings for photoactive yellow protein (1F9I), concanavalin (1NLS), and subtilisin 
(1GCI) using the repulsive Lennard- Jones model. The second column lists the number of torsions 
that contribute to the overall entropy. These torsions belong to residues having more than 40% 
occluded surface area. The number of occluded residues having one or more free torsions is given 
in the third column, with the total length of the protein in parentheses. The number of states and 
entropy (S) per x an d per residue are calculated over the occluded residues only. 
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